House prices report

This document is a data science report of the kaggle house prices tutorial project. It was generated using the Shapash library.

General Information

Version : 0.7

Name : House Prices Prediction Project

Purpose : Predicting the sale price of houses

Date : 2021-04-07

Contributors : Yann Golhen, Sebastien Bidault, Thomas Bouche, Guillaume Vignal, Thibaud Real

Description : This work is a data science project that tries to predict the sale of houses based on 79 explanatory variables. It was designed inside the data science team at X. and improved since the beggining of the project in 20019. The model was put into production since February 2021.


Dataset Information

Origin : The Assessor’s Office

Description : the sale of individual residential property in Ames, Iowa

Depth : from 2006 to 2010

Perimeter : only residential sales

Target Variable : SalePrice

Target Description : The property's sale price in dollars


Data Preparation

Variable Filetring : All variables that required special knowledge or previous calculations for their use were removed

Individual Filtering : only the most recent sales data on any property were kept (for houses that were sold multiple times during this period)

Missing Values : were replaced by 0


Features Engineering

Created Variables : No feature was created. All features are directly taken from the kaggle dataset

Transformed Variables : Categorical features were transformed using an ordinal encoder


Model information

Model used : RandomForestRegressor

Library : sklearn.ensemble._forest

Library version : 0.23.2

Model parameters :

Parameter key Parameter value
base_estimator DecisionTreeRegressor()
n_estimators 50
estimator_params ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'random_state', 'ccp_alpha')
bootstrap True
oob_score False
n_jobs None
random_state None
verbose 0
warm_start False
class_weight None
max_samples None
criterion mse
max_depth None
Parameter key Parameter value
min_samples_split 2
min_samples_leaf 1
min_weight_fraction_leaf 0.0
max_features auto
max_leaf_nodes None
min_impurity_decrease 0.0
min_impurity_split None
ccp_alpha 0.0
n_features_in_ 72
n_features_ 72
n_outputs_ 1
base_estimator_ DecisionTreeRegressor()
estimators_ [DecisionTreeRegressor(max_features='auto', random_state=1844014871), DecisionTreeRegressor(max_features='auto', random_state=79755033), DecisionTreeRegressor(max_features='auto', random_state=1237390286), DecisionTreeRegressor(max_features='auto', random_state=675509169),...

Dataset analysis

Global analysis

Training dataset Prediction dataset
number of features 72 72
number of observations 1095 365
missing values 0 0
% missing values 0 0

Univariate analysis

1stFlrSF - Numeric

First Floor square feet
Training dataset Prediction dataset
count 1095 365
mean 1180 1120
std 400 341
min 334 483
25% 886 864
50% 1100 1050
75% 1420 1320
max 4690 2630

Target analysis

SalePrice - Numeric

Training dataset Prediction dataset
count 1095 365
mean 182000 177000
std 78500 82000
min 34900 40000
25% 130000 126000
50% 165000 160000
75% 215000 205000
max 755000 745000

Model explainability

Note : the explainability graphs were generated using the test set only.

Global feature importance plot

Features contribution plots

1stFlrSF -

First Floor square feet

Model performance

Univariate analysis of target variable

SalePrice - Numeric

True values Prediction values
count 365 365
mean 177000 176000
std 82000 68200
min 40000 77400
25% 126000 129000
50% 160000 160000
75% 205000 199000
max 745000 480000

Metrics

Mean absolute error : 16773.08

Mean squared error : 775516588.1